Skip to content

Conversation

@Xiaoming-AMD
Copy link
Collaborator

@Xiaoming-AMD Xiaoming-AMD commented Nov 13, 2025

  • Create runner/helpers/envs/ directory with modular configuration

    • base_env.sh: Base configuration (logging, cluster info, pythonpath)
    • primus-env.sh: Main entry point with layered loading
    • MI300X.sh, MI325X.sh, MI355X.sh: GPU-specific configurations
    • get_ip_interface.sh, get_nccl_ib_hca.sh: Network detection utilities
  • Refactor configuration loading with clear responsibilities

    • Implement dependency checks between config files
    • Add validation with validate_distributed_params()
    • Support debug mode (PRIMUS_DEBUG=1)
    • Support validation skip (PRIMUS_SKIP_VALIDATION=1)
  • Add comprehensive unit tests for primus-env.sh

    • 10 test cases covering all core functionality
    • Tests for validation, debug mode, defaults, error detection
    • 100% test pass rate (10/10)
  • Update test runner to include new test suite

    • Add test_primus_env.sh to run_all_tests.sh

Architecture improvements:

  • Clear separation of concerns (base, network, perf, gpu-specific)
  • Better maintainability with modular design
  • Robust error handling and validation
  • Production-ready configuration system

- Create runner/helpers/envs/ directory with modular configuration
  - base_env.sh: Base configuration (logging, cluster info, pythonpath)
  - common_network.sh: Network and communication settings (NCCL, RCCL)
  - perf_tuning.sh: Performance tuning and optimizations
  - primus-env.sh: Main entry point with layered loading
  - detect_gpu.sh: GPU model detection
  - MI300X.sh, MI325X.sh, MI355X.sh: GPU-specific configurations
  - get_ip_interface.sh, get_nccl_ib_hca.sh: Network detection utilities

- Refactor configuration loading with clear responsibilities
  - Implement dependency checks between config files
  - Add validation with validate_distributed_params()
  - Support debug mode (PRIMUS_DEBUG=1)
  - Support validation skip (PRIMUS_SKIP_VALIDATION=1)

- Move RCCL configuration from perf_tuning.sh to common_network.sh
  - Better categorization: communication vs performance
  - Unified location for NCCL and RCCL settings

- Rename env_common_network.sh to common_network.sh
  - Consistent naming with other config files

- Move validation function to runner/lib/validation.sh
  - Eliminate duplicate code with DRY principle
  - Remove unnecessary wrapper function (YAGNI, KISS)

- Add comprehensive unit tests for primus-env.sh
  - 10 test cases covering all core functionality
  - Tests for validation, debug mode, defaults, error detection
  - 100% test pass rate (10/10)

- Update test runner to include new test suite
  - Add test_primus_env.sh to run_all_tests.sh

Architecture improvements:
- Clear separation of concerns (base, network, perf, gpu-specific)
- Better maintainability with modular design
- Robust error handling and validation
- Production-ready configuration system
@Xiaoming-AMD Xiaoming-AMD changed the title feat(envs): refactor environment configuration with layered design feature(cli): refactor environment configuration with layered design Nov 13, 2025
Temporarily disable GPU-specific configurations in MI300X, MI325X, and MI355X.
Keep as documentation template for future use.
Merge common_network.sh, perf_tuning.sh, and detect_gpu.sh into base files
to reduce file count and simplify configuration structure.

Changes:
- Merge common_network.sh and perf_tuning.sh into base_env.sh
- Merge detect_gpu.sh into primus-env.sh
- Delete redundant files: common_network.sh, perf_tuning.sh, detect_gpu.sh
- Simplify primus-env.sh loading order (2 layers instead of 4)

Result:
- File count: 10 → 7 files (-30%)
- Cleaner directory structure
- Faster loading (fewer source calls)
- Better integration (GPU detection uses logging functions)
- Improved error handling (GPU detection doesn't exit on failure)
@Xiaoming-AMD Xiaoming-AMD merged commit bdcc2de into main Nov 13, 2025
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants